WIP: Agent Skills Updates From Live Trials by chadvoegele · Pull Request #1493 · NVIDIA/Model-Optimizer

chadvoegele · 2026-05-14T16:43:06Z

What does this PR do?

Type of change: bug fix

Usage

Ask Claude Code:

Quantize `mistralai/Mistral-Medium-3.5-128B` to NVFP4 using the ModelOpt NVFP4 experts-only recipe.

Run on $cluster

Evaluate the resulting quantized checkpoint on:
- GPQA Diamond AA v3
- SciCode AA v2

Complete the quantization and evaluation workflow end to end. Prompt when you require user input, otherwise keep going.

Testing

I'm running the full loop with the above prompt, and iterating on skills to resolve undesired agent behavior.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅
Did you write any new necessary tests?: N/A
Did you update Changelog?: TODO
Did you get Claude approval on this PR?: TODO

Additional Information

See trials log for details.

Summary by CodeRabbit

Release Notes

New Features
- Added SLURM Quality of Service (QoS) configuration support for job submission
- Introduced 8 new evaluation task recipes (AIME 2025, GPQA, IFBench, LiveCodeBench, SciCode, AA-LCR, HLE-AA, MMMU-Pro, tau2_bench)
- Enhanced job monitoring with continuous polling-based tracking
Documentation
- Restructured evaluation workflow with explicit dry-run, canary, and full-run validation stages
- Expanded PTQ validation with mandatory pre-deployment verification gates
- Updated remote cluster selection and quantization detection guidance
Tests
- Updated evaluation test expectations to reflect refined workflow stages

coderabbitai · 2026-05-14T16:43:19Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR enhances evaluation workflows with improved quantization detection and baseline comparison requirements, standardizes task configurations into Markdown-based recipes with Python extraction helpers, introduces PTQ post-quantization validation gates, adds job monitoring via durable polling, and enables SLURM Quality of Service configuration support.

Changes

Evaluation Workflow Enhancement

Layer / File(s)	Summary
Quantization Detection, Deployment Preference, and Baseline Comparison `.claude/skills/evaluation/SKILL.md`	Rewrites ModelOpt quantization auto-detection to check `config.json` first, fall back to `hf_quant_config.json`, conditionally apply vLLM `--quantization` flags, and prefer vLLM for NEL self-deployment. Adds baseline-comparison preflight requiring baseline identification and matching infrastructure setup before accepting quantized score deltas.
Task Configuration Refinement and Container Registry Authentication `.claude/skills/evaluation/SKILL.md`	Updates task confirmation to prefer `recipes/tasks/` reference documents and re-display the updated task list before final confirmation. Expands SLURM registry-auth into a decision flow for public vs. access-restricted images with conditional credential verification and single-retry limits.
Execution Gating, Step Restructuring, and Result Verification `.claude/skills/evaluation/SKILL.md`, `.claude/skills/launching-evals/references/analyze-results.md`	Restructures workflow to add Step 7.5 (registry auth), split Step 8 into 8.1/8.2/8.3 (dry-run, canary, full evaluation), and introduce Steps 9–10 for run verification and baseline-vs-quantized comparability. Strengthens validation with expanded NEL timeout/resume behavior guidance and broadened diagnostics grep patterns.
Evaluation Test Expectations Updates `.claude/skills/evaluation/tests/evals.json`	Updates three evaluation scenarios to reflect new quantization detection logic, conditional vLLM `--quantization` flag behavior, and enhanced evaluation flow expectations including canary log validation and explicit baseline-vs-quantized comparability verification.

Task Recipe Standardization and Configuration Updates

Layer / File(s)	Summary
Core Benchmark Task Recipes `.claude/skills/evaluation/recipes/tasks/aime2025.md`, `gpqa.md`, `mmlu_pro.md`, `scicode.md`, `scicode.yaml` (removed)	Creates comprehensive Markdown task recipes with metadata, YAML fragments for `evaluation.tasks`, and Python-based score extraction logic. AIME 2025 defines 16-repeat symbolic reasoning; GPQA implements 16-sample pass@1[avg-of-N] with stderr; SciCode includes deployment constraints and dual-group score extraction. Removes corresponding YAML recipe files.
Additional Benchmark Task Recipes `.claude/skills/evaluation/recipes/tasks/ifbench.md`, `livecodebench.md`, `mmmu_pro.md`, `aa_lcr.md`, `hle_aa_v2.md`, `tau2_bench_telecom.md`, and removed `*.yaml` files	Adds recipes for IFBench (8-repeat prompt_loose_accuracy), LiveCodeBench v6 (dataset split/retry config), MMMU-Pro (multimodal symbolic_correct), AA-LCR (judge-backed long-context), HLE-AA-v2 (judge evaluation), and Tau2-Bench Telecom (user-simulator endpoint). Removes corresponding YAML configurations.
Example Evaluation Configuration and Environment Updates `.claude/skills/evaluation/recipes/examples/example_eval.yaml`, `env.example`	Updates `example_eval.yaml` to describe task references as providing benchmark requirements and YAML fragments for composition, changes "Smoke test" to "Canary", and corrects MMLU-Pro identifier to `nemo_skills.ns_mmlu_pro`. Revises `env.example` to generalize `JUDGE_API_KEY` for multiple judge-backed tasks and introduce `USER_API_KEY` for tau2_bench_telecom.

PTQ Validation Framework

Layer / File(s)	Summary
PTQ Preflight Checks and Post-Quantization Validation Gate `.claude/skills/ptq/SKILL.md`	Adds preflight step to inspect recipe coverage patterns before calibration and flag partial de-quantization risks. Transforms post-quantization validation into a mandatory pre-deployment gate directing to checkpoint-validation.md with required reporting of size ratios, precision counts, and metadata diffs.
Checkpoint Validation Framework and Scripts `.claude/skills/ptq/references/checkpoint-validation.md`	Expands checkpoint validation into explicit pre-deployment gate with size/bits reduction verification (blocking compression recipes with ratio >= 1.0), linear-layer quantization coverage validation (detecting config mismatches via precision counts), and non-weight metadata consistency checking. Provides bash-invoked Python size-check snippet and comprehensive layer coverage script reading safetensors index and `hf_quant_config.json`.
PTQ Test Scenario with Validation Gating `.claude/skills/ptq/tests.json`	Adds evaluation scenario for FP8→partial NVFP4 quantization with mandatory checkpoint-validation gating including size/ratio reporting, layer precision coverage checks, and conditional stopping before eval submission if validation fails.

Monitoring, Baseline Verification, and Infrastructure

Layer / File(s)	Summary
Durable Job Monitor Loop with Expanded Terminal States `.claude/skills/monitor/SKILL.md`	Replaces 15-minute recurring cron with durable monitor loop that continuously polls `.claude/active_jobs.json`, checks each registered job until terminal states, emits state-change events, removes terminal jobs, and exits when registry is empty. Expands terminal state set and updates raw SLURM termination detection from `squeue` to `sacct`-based check with filtering rules.
Baseline Comparison Verification and Multi-Cluster Selection `.claude/skills/launching-evals/references/analyze-results.md`, `.claude/skills/common/environment-setup.md`	Adds checklist item requiring baseline comparison for quantized runs against matching baseline using consistent benchmark/task/infrastructure. Updates multi-cluster selection to require explicit user prompt instead of silently defaulting to `default_cluster`.
SLURM Quality of Service Configuration Support `tools/launcher/slurm_config.py`, `tools/launcher/core.py`	Adds `qos` field to `SlurmConfig` dataclass, extends `slurm_factory` to accept `qos` parameter from `SLURM_QOS` environment variable, and passes `qos` into `run.SlurmExecutor` construction, enabling Quality of Service configuration for submitted jobs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

ChenhanYu
kaix-nv
meenchen
mxinO
shengliangxu
realAsma

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 inconclusive)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	PR adds tools/launcher/core.py with `# nosec` comments (B404, B603, B607) without justification or codeowner approval as required by SECURITY.md.	Replace `# nosec` with inline comments explaining subprocess safety (hardcoded git commands, no shell=True). Request `@NVIDIA/modelopt-setup-codeowners` review with justification in PR description.
Title check	❓ Inconclusive	The PR title 'WIP: Agent Skills Updates From Live Trials' is generic and vague; it uses 'WIP' and 'Updates From Live Trials' without clearly specifying what the primary change is, making it unclear what aspect of skills was modified.	Replace with a more specific title that highlights the primary change, such as 'Add evaluation task recipes (AIME, GPQA, IFBench, SciCode, AA-LCR, HLE, MMMU, tau2) and refactor PTQ validation workflow'.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cvoegele/agent_evals

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/monitor/SKILL.md:
- Around line 54-59: The documentation/logic currently enforces "report only
state changes" universally; update it so that user-initiated checks (e.g., when
the user explicitly asks "check status") return the full current status for each
job rather than only deltas—leave monitor-driven checks to still compare against
`last_status` in `.claude/active_jobs.json` and report only changes. Adjust the
wording and any associated pseudocode/implementation notes to branch on the
trigger type ("monitor output" vs "user-initiated") and on user-initiated flows
ensure you read the registry, check each job, return current state for each job,
and then update `last_status` accordingly. Ensure references to `last_status`
and `.claude/active_jobs.json` remain consistent so maintainers can find and
implement the conditional behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 27a4dd3f-7246-45c7-8699-ec80a318c50a

📥 Commits

Reviewing files that changed from the base of the PR and between 229ba61 and ddf58b5.

📒 Files selected for processing (11)

.claude/skills/common/environment-setup.md
.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/tests/evals.json
.claude/skills/launching-evals/references/analyze-results.md
.claude/skills/monitor/SKILL.md
.claude/skills/ptq/SKILL.md
.claude/skills/ptq/references/checkpoint-validation.md
.gitignore
modelopt/torch/quantization/model_quant.py
tools/launcher/core.py
tools/launcher/slurm_config.py

github-actions · 2026-05-14T16:48:00Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1493/
Built to branch `gh-pages` at 2026-05-14 16:47 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-05-14T16:59:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.90%. Comparing base (a5bc6f8) to head (2595c72).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1493      +/-   ##
==========================================
+ Coverage   76.79%   76.90%   +0.11%     
==========================================
  Files         474      474              
  Lines       51560    51875     +315     
==========================================
+ Hits        39593    39893     +300     
- Misses      11967    11982      +15

Flag	Coverage Δ
regression	`15.21% <ø> (+0.07%)`	⬆️
unit	`52.64% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.claude/skills/evaluation/SKILL.md (1)

203-212: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix task snippet schema mismatch (tasks vs evaluation.tasks).

The Step 5 example contradicts earlier instructions to edit evaluation.tasks. Keeping tasks: here can make generated configs invalid or ignored.

Suggested fix

-  tasks:
-    - name: <task>
-      nemo_evaluator_config:
-        config:
-          params:
-            temperature: <value>
-            max_new_tokens: <value>
-            ...
+  evaluation:
+    tasks:
+      - name: <task>
+        nemo_evaluator_config:
+          config:
+            params:
+              temperature: <value>
+              max_new_tokens: <value>
+              ...

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/SKILL.md around lines 203 - 212, The YAML example
uses a top-level "tasks:" key which conflicts with the expected
"evaluation.tasks" namespace; update the snippet so the tasks list is nested
under "evaluation.tasks" (e.g., replace "tasks:" with "evaluation.tasks:" and
keep the existing task entries like "name" and "nemo_evaluator_config" intact),
and verify any references to "tasks" in the surrounding text or examples are
corrected to "evaluation.tasks" to keep schema consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/evaluation/recipes/tasks/gpqa.md:
- Around line 73-75: The extractor extract_gpqa_score currently can raise
IndexError when called without args and uses raw open(...) with yaml.safe_load;
add basic argument validation (ensure path is provided and repeats, if given, is
an int) and use a safe file context (with open(path, "r") as f) and
yaml.safe_load(f) while catching FileNotFoundError and yaml.YAMLError and
re-raising a clear ValueError; also validate that the expected keys exist in the
loaded dict (results -> groups -> gpqa -> metrics) and raise ValueError if
missing. Apply the same validation and safe-loading pattern to the similar
extractor function around lines 94-97 to ensure consistent error handling.

In @.claude/skills/evaluation/recipes/tasks/scicode.md:
- Around line 105-108: The extract_score function currently assumes a valid path
and opens the YAML without a context manager; fix it by validating the path
argument (raise ValueError or return a clear error if path is falsy), check the
file exists (catch FileNotFoundError), and read the YAML using a context manager
(with open(path) as f: data = yaml.safe_load(f)); then safely access
TASKS[group] and data["results"]["groups"][group"]["metrics"] (use .get or catch
KeyError to provide a clearer error). Apply the same changes to the other
identical snippet that reads the YAML and accesses metrics.

---

Outside diff comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 203-212: The YAML example uses a top-level "tasks:" key which
conflicts with the expected "evaluation.tasks" namespace; update the snippet so
the tasks list is nested under "evaluation.tasks" (e.g., replace "tasks:" with
"evaluation.tasks:" and keep the existing task entries like "name" and
"nemo_evaluator_config" intact), and verify any references to "tasks" in the
surrounding text or examples are corrected to "evaluation.tasks" to keep schema
consistent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1c2365f0-8719-4c16-9dd8-7640d57849f7

📥 Commits

Reviewing files that changed from the base of the PR and between ddf58b5 and 0633922.

📒 Files selected for processing (18)

.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/recipes/examples/example_eval.yaml
.claude/skills/evaluation/recipes/tasks/aime2025.md
.claude/skills/evaluation/recipes/tasks/aime2025.yaml
.claude/skills/evaluation/recipes/tasks/gpqa.md
.claude/skills/evaluation/recipes/tasks/gpqa.yaml
.claude/skills/evaluation/recipes/tasks/ifbench.md
.claude/skills/evaluation/recipes/tasks/ifbench.yaml
.claude/skills/evaluation/recipes/tasks/livecodebench.md
.claude/skills/evaluation/recipes/tasks/livecodebench.yaml
.claude/skills/evaluation/recipes/tasks/mmlu_pro.md
.claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
.claude/skills/evaluation/recipes/tasks/scicode.md
.claude/skills/evaluation/recipes/tasks/scicode.yaml
.claude/skills/evaluation/tests/evals.json
.claude/skills/ptq/SKILL.md
.claude/skills/ptq/references/checkpoint-validation.md
.claude/skills/ptq/tests.json

💤 Files with no reviewable changes (6)

.claude/skills/evaluation/recipes/tasks/ifbench.yaml
.claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
.claude/skills/evaluation/recipes/tasks/aime2025.yaml
.claude/skills/evaluation/recipes/tasks/scicode.yaml
.claude/skills/evaluation/recipes/tasks/gpqa.yaml
.claude/skills/evaluation/recipes/tasks/livecodebench.yaml

✅ Files skipped from review due to trivial changes (3)

.claude/skills/evaluation/recipes/examples/example_eval.yaml
.claude/skills/evaluation/recipes/tasks/ifbench.md
.claude/skills/evaluation/recipes/tasks/mmlu_pro.md

🚧 Files skipped from review as they are similar to previous changes (1)

.claude/skills/evaluation/tests/evals.json

coderabbitai · 2026-05-14T19:21:59Z

+def extract_gpqa_score(path, repeats=None):
+    data = yaml.safe_load(open(path))
+    metrics = data["results"]["groups"]["gpqa"]["metrics"]


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add basic argument validation and safe file loading in the extractor.

This snippet can throw IndexError if invoked without args and should avoid raw open(...) in yaml.safe_load.

Suggested fix

def extract_gpqa_score(path, repeats=None): - data = yaml.safe_load(open(path)) + with open(path, "r", encoding="utf-8") as f: + data = yaml.safe_load(f) @@ if __name__ == "__main__": - path = sys.argv[1] + if len(sys.argv) < 2: + raise SystemExit("Usage: python extract_gpqa_score.py <results.yaml> [repeats]") + path = sys.argv[1] repeats = int(sys.argv[2]) if len(sys.argv) > 2 else None print(extract_gpqa_score(path, repeats))

Also applies to: 94-97

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/evaluation/recipes/tasks/gpqa.md around lines 73 - 75, The extractor extract_gpqa_score currently can raise IndexError when called without args and uses raw open(...) with yaml.safe_load; add basic argument validation (ensure path is provided and repeats, if given, is an int) and use a safe file context (with open(path, "r") as f) and yaml.safe_load(f) while catching FileNotFoundError and yaml.YAMLError and re-raising a clear ValueError; also validate that the expected keys exist in the loaded dict (results -> groups -> gpqa -> metrics) and raise ValueError if missing. Apply the same validation and safe-loading pattern to the similar extractor function around lines 94-97 to ensure consistent error handling.

coderabbitai · 2026-05-14T19:21:59Z

+def extract_score(path, group="scicode"):
+    spec = TASKS[group]
+    data = yaml.safe_load(open(path))
+    metrics = data["results"]["groups"][group]["metrics"]


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Harden CLI/file handling in the score extractor snippet.

The snippet crashes with IndexError when no path is passed, and it opens the YAML file without a context manager.

Suggested fix

def extract_score(path, group="scicode"): spec = TASKS[group] - data = yaml.safe_load(open(path)) + with open(path, "r", encoding="utf-8") as f: + data = yaml.safe_load(f) @@ if __name__ == "__main__": - path = sys.argv[1] + if len(sys.argv) < 2: + raise SystemExit("Usage: python extract_score.py <results.yaml> [scicode|gpqa]") + path = sys.argv[1] group = sys.argv[2] if len(sys.argv) > 2 else "scicode" print(extract_score(path, group))

Also applies to: 135-138

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/evaluation/recipes/tasks/scicode.md around lines 105 - 108, The extract_score function currently assumes a valid path and opens the YAML without a context manager; fix it by validating the path argument (raise ValueError or return a clear error if path is falsy), check the file exists (catch FileNotFoundError), and read the YAML using a context manager (with open(path) as f: data = yaml.safe_load(f)); then safely access TASKS[group] and data["results"]["groups"][group"]["metrics"] (use .get or catch KeyError to provide a clearer error). Apply the same changes to the other identical snippet that reads the YAML and accesses metrics.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 260-267: The preflight check currently tests for any credential
entry using the generic grep command; change it to verify credentials per
registry host used by the selected images by searching for the specific registry
hostnames (not just any "machine" entry) in ~/.config/enroot/.credentials.
Update the documented check (the grep invocation shown) to demonstrate matching
the actual registry host(s) (e.g., loop or run grep for each selected image's
registry host) so the preflight returns true only when credentials exist for
those specific hosts.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 466a2d74-5bf4-40f5-a94e-78d17d191a59

📥 Commits

Reviewing files that changed from the base of the PR and between 0633922 and 29bbb20.

📒 Files selected for processing (7)

.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/recipes/env.example
.claude/skills/evaluation/recipes/tasks/aa_lcr.md
.claude/skills/evaluation/recipes/tasks/aime2025.md
.claude/skills/evaluation/recipes/tasks/hle_aa.md
.claude/skills/evaluation/recipes/tasks/ifbench.md
.claude/skills/evaluation/recipes/tasks/mmlu_pro_aa_v3.md

✅ Files skipped from review due to trivial changes (6)

.claude/skills/evaluation/recipes/env.example
.claude/skills/evaluation/recipes/tasks/mmlu_pro_aa_v3.md
.claude/skills/evaluation/recipes/tasks/aa_lcr.md
.claude/skills/evaluation/recipes/tasks/hle_aa.md
.claude/skills/evaluation/recipes/tasks/ifbench.md
.claude/skills/evaluation/recipes/tasks/aime2025.md

coderabbitai · 2026-05-15T22:35:05Z

 ```bash
 ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"
 ```

 **Decision flow (check before submitting):**
-1. Check if the cluster has credentials for the default DockerHub image (see command above)
-2. If DockerHub credentials exist → use the default image and submit
-3. If DockerHub credentials are missing but can be added → add them (see `slurm-setup.md` section 6), then submit
-4. If DockerHub credentials cannot be added → override `deployment.image` to the NGC alternative and submit:
+1. If the selected images are public → submit without an auth preflight
+2. If any selected image is private or access-restricted → check for credentials for that image's registry (see command above)
+3. If credentials exist → use the selected image and submit


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make credential verification registry-specific.

Line 261 currently checks whether any credential entry exists, not whether credentials exist for the selected registry host(s). That can pass preflight but still fail image pulls.

Suggested doc fix

-ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null" +ssh <host> "awk '/^\s*machine\s+/ {print $2}' ~/.config/enroot/.credentials 2>/dev/null" +# Verify the required registry host(s) from selected images are present (e.g., docker.io, nvcr.io, registry.internal).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/evaluation/SKILL.md around lines 260 - 267, The preflight check currently tests for any credential entry using the generic grep command; change it to verify credentials per registry host used by the selected images by searching for the specific registry hostnames (not just any "machine" entry) in ~/.config/enroot/.credentials. Update the documented check (the grep invocation shown) to demonstrate matching the actual registry host(s) (e.g., loop or run grep for each selected image's registry host) so the preflight returns true only when credentials exist for those specific hosts.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.claude/skills/evaluation/recipes/tasks/tau2_bench_telecom.md (1)
38-39: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Complete the Score Extraction section.

The Score Extraction section contains only a header with no content. Users need guidance on how to extract and interpret the pass_1 metric for this task. Please add content similar to other task recipes (e.g., AIME 2025, GPQA) that explains which metric to use and how to extract it from the evaluation results.

Do you want me to help draft the score extraction guidance based on the primary metric pass_1 and the tau2_bench harness documentation?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/recipes/tasks/tau2_bench_telecom.md around lines
38 - 39, The Score Extraction section is empty—add a short paragraph stating
that the primary metric is pass_1 and instruct users to extract the pass_1 value
from the tau2_bench evaluation results (e.g., from the metrics or results JSON
under the "pass_1" key), report it as a percentage (multiply by 100 if the
harness returns a fraction), and include any aggregation used (mean across seeds
or runs). Reference the task name tau2_bench_telecom and the harness docs for
exact JSON field names and show that the reported score should be the aggregated
pass_1 value used for comparisons.

♻️ Duplicate comments (1)

.claude/skills/evaluation/SKILL.md (1)

260-262: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make credential verification registry-specific.

The grep command checks for any credential entry, not credentials for the specific registry host(s) used by the selected images. This can pass the preflight but still fail image pulls if credentials for the required registries are missing.

📝 Suggested fix

-ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"
+ssh <host> "awk '/^\s*machine\s+/ {print $2}' ~/.config/enroot/.credentials 2>/dev/null"
+# Verify the required registry host(s) from selected images are present (e.g., docker.io, nvcr.io, registry.internal).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/SKILL.md around lines 260 - 262, The current
preflight uses the command string ssh <host> "grep -E '^\s*machine\s+'
~/.config/enroot/.credentials 2>/dev/null" which only checks for any credential
entry; change it to verify registry-specific credentials by extracting registry
hostnames from the selected images and running grep for each host (e.g., grep -E
"^\s*machine\s+<registryHost>\b" ~/.config/enroot/.credentials) or equivalent
per-host checks over SSH; update the code that emits the ssh grep command in
SKILL.md to iterate the image registry list and fail the preflight if any
registryHost lookup returns no match.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 59-68: Update the ambiguous phrase "Do Step 3, Step 4, then Step
7.5/8" in the SKILL.md shortcut section to explicitly state which Step 8
substeps apply; replace it with something like "Do Step 3, Step 4, Step 7.5,
then Step 8 (complete the applicable substeps 8.1/8.2/8.3)" or "Do Step 3, Step
4, Step 7.5, then Step 8.1–8.3 as applicable" so readers aren’t confused by the
restructured Step 8; locate the exact string "Do Step 3, Step 4, then Step
7.5/8" and update it accordingly.

---

Outside diff comments:
In @.claude/skills/evaluation/recipes/tasks/tau2_bench_telecom.md:
- Around line 38-39: The Score Extraction section is empty—add a short paragraph
stating that the primary metric is pass_1 and instruct users to extract the
pass_1 value from the tau2_bench evaluation results (e.g., from the metrics or
results JSON under the "pass_1" key), report it as a percentage (multiply by 100
if the harness returns a fraction), and include any aggregation used (mean
across seeds or runs). Reference the task name tau2_bench_telecom and the
harness docs for exact JSON field names and show that the reported score should
be the aggregated pass_1 value used for comparisons.

---

Duplicate comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 260-262: The current preflight uses the command string ssh <host>
"grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null" which only
checks for any credential entry; change it to verify registry-specific
credentials by extracting registry hostnames from the selected images and
running grep for each host (e.g., grep -E "^\s*machine\s+<registryHost>\b"
~/.config/enroot/.credentials) or equivalent per-host checks over SSH; update
the code that emits the ssh grep command in SKILL.md to iterate the image
registry list and fail the preflight if any registryHost lookup returns no
match.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 38a710dc-b0aa-4860-99a8-27d5da2f52cc

📥 Commits

Reviewing files that changed from the base of the PR and between 29bbb20 and 2580a58.

📒 Files selected for processing (13)

.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/recipes/env.example
.claude/skills/evaluation/recipes/examples/example_eval.yaml
.claude/skills/evaluation/recipes/tasks/aa_lcr.md
.claude/skills/evaluation/recipes/tasks/aime2025.md
.claude/skills/evaluation/recipes/tasks/gpqa.md
.claude/skills/evaluation/recipes/tasks/hle_aa_v2.md
.claude/skills/evaluation/recipes/tasks/ifbench.md
.claude/skills/evaluation/recipes/tasks/livecodebench.md
.claude/skills/evaluation/recipes/tasks/mmlu_pro.md
.claude/skills/evaluation/recipes/tasks/mmmu_pro.md
.claude/skills/evaluation/recipes/tasks/scicode.md
.claude/skills/evaluation/recipes/tasks/tau2_bench_telecom.md

✅ Files skipped from review due to trivial changes (6)

.claude/skills/evaluation/recipes/tasks/mmmu_pro.md
.claude/skills/evaluation/recipes/tasks/hle_aa_v2.md
.claude/skills/evaluation/recipes/tasks/ifbench.md
.claude/skills/evaluation/recipes/tasks/mmlu_pro.md
.claude/skills/evaluation/recipes/tasks/livecodebench.md
.claude/skills/evaluation/recipes/tasks/aa_lcr.md

🚧 Files skipped from review as they are similar to previous changes (4)

.claude/skills/evaluation/recipes/env.example
.claude/skills/evaluation/recipes/tasks/gpqa.md
.claude/skills/evaluation/recipes/examples/example_eval.yaml
.claude/skills/evaluation/recipes/tasks/scicode.md

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

shengliangxu

LGTM

cjluo-nv · 2026-05-21T05:23:59Z

+Prefer the `pass@1[avg-of-N]` metric matching the configured repeat count.
+If the repeat count is unknown, use the highest available `avg-of-N`.
+
+```python


qq why we still need these python code here?

chadvoegele requested a review from a team as a code owner May 14, 2026 16:43

chadvoegele requested a review from jingyu-ml May 14, 2026 16:43

chadvoegele changed the title ~~Agent Skills Updates From Live Trials~~ WIP: Agent Skills Updates From Live Trials May 14, 2026

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Comment thread .claude/skills/monitor/SKILL.md Outdated

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread .claude/skills/evaluation/SKILL.md

chadvoegele force-pushed the cvoegele/agent_evals branch 2 times, most recently from 7482571 to f2b92ba Compare May 20, 2026 14:36

chadvoegele added 18 commits May 20, 2026 09:44

Update evaluation skill guidance

66fd103

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Refine agent skill guidance

8aad1eb

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Clarify quantized eval baseline comparison

02ec0b2

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Document repeat guidance for reasoning evals

267c19e

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Add PTQ and evaluation verification guidance

de7cd39

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Deduplicate PTQ checkpoint size guidance

874581c

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Deduplicate evaluation recipe guidance

187ca1e

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Add SLURM QoS launcher option

947074b

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Make PTQ checkpoint validation a required gate

d885ad6

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Refine evaluation run gating guidance

be79555

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Document NEL timeout resume behavior

8b6cc5f

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Split evaluation validation and comparability steps

717507f

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Convert evaluation task snippets to references

9092bc4

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Add evaluation task references

1b5e031

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Use NeMo Skills MMLU-Pro recipe

cce42cc

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Update evaluation task recipes

f40898d

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Add debugging playbooks skill

b793ea5

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Clarify monitor status handling

a662e43

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

chadvoegele and others added 4 commits May 20, 2026 09:44

Use ns_hle_aa for HLE AA evaluations

f360752

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Scope agent state by session

ffa7558

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Add evaluation score extraction helpers

343fe71

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Add robust monitor status parsing

32c2072

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

chadvoegele force-pushed the cvoegele/agent_evals branch from f2b92ba to 32c2072 Compare May 20, 2026 14:45

chadvoegele added 4 commits May 20, 2026 09:54

Increase GPQA evaluation repeats

74635c9

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Fix markdownlint formatting in skill docs

44499ba

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Fix launcher Slurm config typing

fd07d91

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Use launcher-compatible optional type hints

2595c72

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

shengliangxu approved these changes May 20, 2026

View reviewed changes

cjluo-nv reviewed May 21, 2026

View reviewed changes

Conversation

chadvoegele commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Pre-merge checks failed

❌ Failed checks (1 error, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026

Built to branch gh-pages at 2026-05-14 16:47 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shengliangxu left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chadvoegele commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-14 16:47 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented May 14, 2026 •

edited

Loading